Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions
نویسنده
چکیده
Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference.
منابع مشابه
Laplacian Eigenfunctions Learn Population Structure
Principal components analysis has been used for decades to summarize genetic variation across geographic regions and to infer population migration history. More recently, with the advent of genome-wide association studies of complex traits, it has become a commonly-used tool for detection and correction of confounding due to population structure. However, principal components are generally sens...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملPopulation Data on D7S2425 Marker in Five Ethnic Groups of the Iranian Population: A Highly Informative Marker for Molecular Diagnosis of ARNSHL
Background & Aims: SLC26A4 gene mutations are the second identifiable genetic cause of autosomal recessive nonsyndromic hearing loss (ARNSHL) after GJB2 mutations and are currently investigated in molecular diagnosis.In databases, several potential STR markers related to this region have been introduced. In this investigation, the characteristics and informativeness of D7S2425 CA repeat STR mar...
متن کاملکاوش ژنومی نشانه های انتخاب در گاوهای بومی نژاد سرابی و تالشی ایران
The aim of this study was to find the footprint of selection in native Sarabi and Taleshi cattle breeds 296 cattle from two breeds were sampled and genotyped. by 40 k microarray of illumine company. 43 animals were removed because their ACR was below 0.09. Markers were filtered with minor allele frequency (MAF) equal 0.01 and Hardy-Weinberg equilibrium test (10-6). After filtering, 28782 marker...
متن کاملGraphic analysis of population structure on genome-wide rheumatoid arthritis data
Principal-component analysis (PCA) has been used for decades to summarize the human genetic variation across geographic regions and to infer population migration history. Reduction of spurious associations due to population structure is crucial for the success of disease association studies. Recently, PCA has also become a popular method for detecting population structure and correction of popu...
متن کامل